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Abstract 

Vector quantization-based approaches are successful to 
solve Approximate Nearest Neighbor (ANN) problems 
which are critical to many applications. The idea is to 
generate effective encodings to allow fast distance ap¬ 
proximation. We propose quantization-based methods 
should partition the data space finely and exhibit lo¬ 
cality of the dataset to allow efficient non-exhaustive 
search. In this paper, we introduce the concept of High 
Capacity Locality Aggregating Encodings (HCLAE) to 
this end, and propose Dictionary Annealing (DA) to 
learn HCLAE by a simulated annealing procedure. The 
quantization error is lower than other state-of-the-art. 

The algorithms of DA can be easily extended to an on¬ 
line learning scheme, allowing effective handle of large 
scale data. Eurther, we propose Aggregating-Tree (A- 
Tree), a non-exhaustive search method using HCLAE to 
perform efficient ANN-Search. A-Tree achieves magni¬ 
tudes of speed-up on ANN-Search tasks, compared to 
the state-of-the-art. 

Introduction 

Approximate nearest neighbor (ANN) search is a fundamen¬ 
tal problem in many computer science topics, especially in 
those involving high-dimensional and large-scale datasets 
like machine learning, pattern recognition, computer vi¬ 
sion, information retrieval, etc, due to the high computa¬ 
tion efficiency requirements. Among existing ANN tech¬ 
niques, quantization-based algorithms((Jegou, Douze, and 
Schmid 201 l),(Ge et al. 2013),(Ting Zhang 2014), etc.) have 
shown the state-of-the-art performances by allowing effi¬ 
cient distance computation via asymmetric distance compu¬ 
tation (ADC)(Jegou, Douze, and Schmid 2011) between a 
query vector and an encoded vector. One can perform an ex¬ 
haustive ADC to retrieve the approximate nearest neighbor. 

Even so, an exhaustive comparison between the query 
and the dataset is still prohibitive for even larger datasets 
like (Torralba, Fergus, and Freeman 2008). IVFADC (Je¬ 
gou, Douze, and Schmid 2011) provides non-exhaustive 
search based on coarse quantizers and encoded residues. 
The idea is to obtain a candidates list possibly containing 
the nearest neighbor, then perform ADC on the list. Similar 
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methods like inverted multi-index(Babenko and Lempitsky 
2012), Locally Optimized Product Quantization(Kalantidis 
and Avrithis 2014), Joint Inverted Indexing (Xia et al. 2013), 
etc, has various improvements. 

Problems of existing quantization-based algorithms. 

One challenge in designing non-exhaustive search algorithm 
is: the locality of a vector is not exhibited in the encod¬ 
ing. Thus, researchers have to do some roundabout to dig 
out the locality, like using a coarse quantizer. These meth¬ 
ods lack efficiency because candidate listing and re-ranking 
are totally irrelevant. In addition, we would like the encod¬ 
ings to have high capacity w.r.t the data space, i.e. to dis¬ 
tinguish more vectors, so the data space can be effectively 
represented. However, existing quantization methods didn’t 
explicitly consider these issues. 

Major Contributions In this paper, we are interested in 
encodings which not only accelerate distance computation, 
but also ’aggregate’ the locality of a dataset, along with 
high capacities. We introduce the concept of High Capacity 
Locally Aggregating Encodings (HCLAE) for ANN-search 
to address the aforementioned problems. We propose Dic¬ 
tionary Annealing (DA) algorithm to generate HCLAE en¬ 
codings of the dataset. Inspired by simulated annealing, the 
main idea of DA is to ’’heat up” a dictionary with cur¬ 
rent residue, then ’’cool down” the dictionary to reduce the 
residue. Auxiliary algorithms for DA are also introduced to 
further increase capacity and to reduce distortion. DA is nat¬ 
urally an online learning algorithm and is suitable for large 
scale learning. 

To utilize HCLAE encodings on large scale data, we 
propose Aggregating Tree (A-Tree) for fast non-exhaustive 
search. It’s a radix-tree like structure based on the encoding 
of the dataset, so the common prefixes of the encodings can 
be effectively represented with one node. A-tree is memory 
efficient and allows fast non-exhaustive search: we breadth 
first traverse the tree with a priority queue to obtain the can¬ 
didate list. The time consumption is significantly lower than 
other non-exhaustive search methods. 

We have validated DA and A-Tree on various stan¬ 
dard benchmarks: SIFT-IM, CIST-lM(Jegou, Douze, and 
Schmid 2011), SIFT-lB(Jegou et al. 2011). Empirical Re¬ 
sults show DA improves the quantization of dataset greatly, 
and A-Tree can bring magnitudes of speed up compared 
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Figure 1: Mutual information matrices of local vectors’ en¬ 
codings on GISTIM dataset, M = S, K = 256, indicating 
Locally Aggregating of different methods. 



(a) AQ (b) OPQ (c) RVQ (d) DA 



(a) AQ (b) OPQ (c) RVQ (d) DA 

Figure 2: Mutual information matrices of all encoded vec¬ 
tors on GISTIM dataset, M = 8, AT = 256 indicating En¬ 
coding Capacity of different methods. Best viewed in color. 


to existing non-exhaustive search methods. The overall per¬ 
formance of DA and A-Tree outperforms existing state-of- 
the-art methods. The online version also shows great prac¬ 
tical interest. Applications depending on ANN search cam 
greatly benefit from our algorithms. 

Background and Motivation 

The main idea of quantization-based methods is to gener¬ 
ate encodings consisting of M parts for fast distance com¬ 
putation. For example. Product Quantization(Jegou, Douze, 
and Schmid 2011) splits the data space into M disjoint 
subspaces, and separately learns dictionaries for each sub¬ 
spaces, then quantizes each subspace to produces encod¬ 
ings of a vector x {ii(x), i 2 (x), • • • , iM(x)}. PQ allows 
fast approximate distance computation between a query 
vector and an encoded vector via Asymmetric Distance 
Computation(ADC), which is discussed in detail in (Jegou, 
Douze, and Schmid 2011), (Babenko and Lempitsky 2015), 
(Babenko and Lempitsky 2014). 

However, in the real applications involving large scale 
data, exhaustively computing distances doesn’t meet the 
query speed requirement. It’s practical to perform some 
preprocessing such as candidates listing. IVFADC(Je- 
gou, Douze, and Schmid 2011), the Inverted Multi- 
index(Babenko and Lempitsky 2012) and Locally Opti¬ 
mized Product Quantization(Kalantidis and Avrithis 2014), 
etc. are proposed to perform these tasks. However, these can¬ 
didates listing methods are totally irrelevant to the encodings 
of the dataset, adding additional computation and storage 
cost. 

Locally Aggregating Encodings 

A common methodology for non-exhaustive search is 
bound-and-branch with trees. The effectiveness of tree struc¬ 
tures lies in how can it effectively tell which child node 
contains the nearest neighbor. However, in high-dimensional 


space, tree structures like KD-Tree(Friedman, Bentley, and 
Finkel 1977) generally degrades to linear scan because 
the nearest neighbor may be contained in any node(We- 
ber, Schek, and Blott 1998). To utilize bound-and-branch 
methodology, this search scope must be able to narrow 
down. 

Our solution is to utilize the priors of the visited node: if 
a node is deep in the tree, then we know which child node 
may contain the nearest neighbor. We name it Locally Ag¬ 
gregating. Note one can transform encodings to a radix tree. 
Denote the m-th part encoding of a vector x’s local vector 
x' as Im = im(x'), and Crn as the conditional entropy: 

I /i, /2, • • • , 7m—1) 

Cm directly measures to what extent can we narrow down 
the search scope, so a fast descending Cm is preferred. Di¬ 
rectly computing Cm is not easy, nevertheless, we present 
the mutual information matrix of Im obtained with different 
quantization methods in Figure 1 for visualization. 

Encoding Capacity 

To effective encode a dataset, we would like the data-space 
is partitioned finely, so vectors could be easily distinguished. 
It’s straightforward to define the Encoding Capacity as the 
total information entropy: 5 = i7(/i, / 2 , • • • , /m)- In prac¬ 
tice, optimizing encoding capacity is usually relaxed into 
two separate objectives: 

1. Maximize self-information H{Im) for m = 1 • • • M 

2. Minimize mutual information for i^j = 

The above objectives were explicitly considered in hashing 
methods including Spectral Hashing(Weiss, Torralba, and 
Fergus 2009), Semi-supervised Hashing(Wang, Kumar, and 
Chang 2010), etc. which are proposed to learn balanced and 
uncorrelated bits. For quantization methods, encoding ca¬ 
pacity has not been addressed yet. In Figure 2, we visualize 
the comparison of the encoding capacities of different quan¬ 
tization methods in mutual information matrix. 

Learning High Capacity Locally Aggregating 
Encodings (HCLAE) 

As described above, for a high capacity encoding, H (I^) is 
maximized. By chain rule, to lower Cm = 77(Ii, • • • , Im) ~ 
77(1^), i7(Ii, • • • ,1^) should be minimized, I.e. the local 
vectors should have the same prefix encoding. By Lloyd’s 
condition(Gray 1984), we could perform Residual Vector 
Quantization(RVQ)(Juang and Gray Jr 1982)(Chen, Guan, 
and Wang 2010) on the dataset. However for high dimen¬ 
sional data, the encoding capacity is low with RVQ and 
doesn’t exhibit locally aggregating. We introduce Dictionary 
Annealing to produce High Capacity Locally Aggregating 
Encodings. 

Dictionary Annealing 

Dictionary Annealing(DA) performs simulated annealing on 
an series of existing dictionaries, while it can also learn 
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Figure 3: The illustration of one iteration of Dictionary An¬ 
nealing. We perform several iterations to optimize the whole 
series of dictionaries. 


dictionaries from scratch. Figure 3 provides an intuitive il¬ 
lustration of DA. To optimize a single dictionary Cm = 
{c^(l), • • • , c^(Ff)}, let’s assume it’s already at the local 
lowest energy position, i.e, not improving on the previous 
optimization/learning. We first ’’heat up” the dictionary, by 
putting the ’’noisy” residue {Ox} into to generate an in¬ 
termediate dataset: 

x' = ex +Cm(zm(x)) 

, Then we ’’cool down” dictionary by incrementally fit- 
ting {x'}. 

Why the intermediate dataset and why using residues? We 
have two reasons: 

• The intermediate dataset is the residue dropping cur¬ 
rently optimizing dictionary. The quantization error is re¬ 
duced if the intermediate dataset is better fitted. On the 
whole picture, this m-th dictionary does a better job and 
residues left for the next dictionary is lowered, lowering 

I'm)- 

• The residues are independent to other dictionary spaces, 
as they’re ’’noises” to these dictionaries. Messing with 
residues won’t rise mutual information between dictionar¬ 
ies. So we can push the H (I^) higher without worry. 

Given a series of dictionaries, the algorithm is performed 
by multiple iterations. On each iteration, we optimize one 
dictionary, then re-encode the dataset to obtain the new 
residue for the next iteration. To learn dictionaries from 
scratch, one can simply perform DA on ’’all-zeros” dictio¬ 
naries, ^ To bring better performance, we propose the fol¬ 
lowing two auxiliary algorithms: 

Tn this case DA is quite similar to Residual Vector Quanti¬ 
zation: the intermediate dataset of an ’’all-zeros” dictionary is the 
same to the residues 


Improved K-means for High-dimensional Residue 

Clustering on high dimensional space is not easy, espe¬ 
cially on high-dimensional residues as the randomness is in¬ 
creased. To obtain a better clustering for high dimensional 
data, one approach is to cluster on lower-dimensional sub- 
space(Agrawal et al. 1998), which is also done by PQ/OPQ 
to obtain high information entropy for each dictionary. (Ding 
and He 2004) indicates that PCA dimension reduction is 
particularly beneficial for K-means clustering, as it finds 
best low rank L2 approximations. In addition, the dic¬ 
tionary learned previously can provide initial points good 
enough, which is important for k-means clustering(Bradley 
and Fayyad 1998). 

Our idea is to preserve the clustering information on 
lower-dimensional subspace for higher-dimensional sub¬ 
space clustering. To optimize dictionary for {x'}, we 
first designate a dimension adding sequence: di < d 2 < 
''' < di = d, then: 

1. Project and {x'} into PCA space R of {x'}, obtain¬ 
ing rotated dictionary : {c'^{k) = 'Rcm{k),k = 
1 • • • AT} and rotated intermediate dataset {x^ = Rx'}. 

2. Optimize by performing K-means on {x^}, initial¬ 
ized with C^, using only the top di dimensional data, 
then on the top d 2 , next on ds, • • • , andfinallyondj = d 
dimensions. 

3. Rotate back to finish the optimization: = 

The choice of di • • • d/ have minor effect on the optimiza¬ 
tion of a dictionary. We choice di = d^!^ = 10 in our 

experiments. 

Multi-path Encoding 

To encode with DA dictionaries, we seek the code that min¬ 
imizes the quantization error E for an input vector x: 

M 

^ = 11^- 5? Cm(im(x))||2 

m=l 

M 

= H 11^ - c™(i™(x))f - (m - l)||x||^ (1) 

m=l 

M M 

+ 'E E Ca(ia(x))'^C6(4(x)) 

a=l b=l,b^a 

The above algorithm is a typical fully connected MRF 
problem. Though the optimization of E can be solved ap¬ 
proximately by various existing algorithms, they’re very 
time consuming(Babenko and Lempitsky 2015). 

Similar to the concept of Locally Aggregating Encod¬ 
ing, if given an oracle the correct first m — 1 encodings, 
can we effectively tell the correct encoding on the m-th 
part? Denote the correct encoding of a input vector as x ~ 

ci(ii)+€ 2 (^ 2 ) H-t-c^(i^), and the known m — 1 correct 

encodings ii, ^ 2 , * * * : we consider quantization error 

A as a function of 
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where x = Ci(ii) + ••• + and x= 

Cm+l('^m+l) H“ ' ' ' H“ {'^M^ 

We seek the best im among 1’ - K to minimize E. In 
Equation 2, terms 1/2/3/7 are constant and negligible, terms 
4/5 can be computed. Only the 6-th term cannot be computed 

because we don’t know x. We want it to be small so it won’t 
seriously affects the final outcome. 

Thus we rearrange the dictionaries in the descending 
order of dictionary’s elements variance. Note that DA 
learned from scratch naturally produces variance descend¬ 
ing dictionaries. We further adopt beam search to en¬ 
code a vector. That is, we maintain a list of best L 
approximations of x on the first (m — 1) dictionaries: 

• • • , Then we encode with the next 

dictionary = {c^(l), c^(2), • • • , c^(A:)}. We find 
L combinations from Cm{k)}J G 1 • • • L, /c G 

1 • • • iT by minimizing the following objective function: 


||x - ar' - Cm{k)r =l|x - ar'iP + l|x - c„(fc)||2 

-||x||2 + 2c„(fcfar-' 

(3) 

We enumerate KL combinations and select top L candi¬ 
dates. For each combination in Equation 3: 

• The first term has been computed at the previous encoding 
step - one table lookup. 

• The second term ||x — |P is pre-computed for each 

encoding vector taking 0{dK) time - one table lookup 

• The third term ||x|p is a negligible constant. 

• The last term involves m table lookups and addition, with 
the inner-product of all dictionaries elements precom¬ 
puted before the beam search procedure. 

To sum up, the time complexity is 0(dK -j- mKL + 
KLlogL) for encoding with one single dictionary. Note 
for fresh start DA, we don’t need to encode the previously 
learned dictionaries excessively after we optimized a ’’zero” 
dictionary(i.e. learned a new dictionary). We report the L- 
distortion curve in Figure 4(c), we found that a relatively low 
L = 10 could already achieve satisfactory encoding quality. 
We use this configuration in the rest of the experiments. 

Online Dictionary Learning 

DA can be easily extended to an online learning mechanism 
to utilize even larger scale dataset, where clustering on all 
data could be prohibitive, or new data is not yet available 
currently. Online learning with DA can be done simply by 
optimizing the learned dictionaries to fit the new coming 
data. We report online learning result for SIFTlB(Jegou et 
al. 2011) dataset in Figure 4(a). 
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Figure 6: A toy illustration of the structure of A-Tree 


Aggregating Tree 

We are now able to adopt bound-and-branch methodology to 
high-dimensional data by Aggregating Tree(A-Tree). After 
obtaining HCFAEs with DA, A-Tree is constructed accord¬ 
ing to the encodings like a radix tree(each node that is the 
only child is merged with its parent), except that we only 
merge leaf nodes. A-Tree effectively presents the quantized 
dataset, with all encodings written directly on the tree. A 
demonstrative structure of A-Tree is shown in Figure 6. 

To perform non-exhaustive search on A-Tree, the idea is 
to maintain a candidate list like in multi-path encoding. First 
we determine a candidate list with size limit for each layer 
as: I/i, • • • , Lm- Given a query vector q, we start with an 
initial candidate list containing only the root node, and it¬ 
eratively do the following for M times (The procedure is 
illustrated in Figure 5): 

1. Replace the nodes in candidate list with their children. If 
the node has no children, it stays in the candidate list. 

2. If the size of the candidate list exceeds Li{i is the current 
iteration number), shrink it to Li, and discard the nodes 
distant to the query vector. 

We have to record some extra information on each node 
to allow fast distance computation. Let m denote the depth 
of a node T, and pi, • • • , Pm is the path from the root to this 
node, we record: 

m—1 

for T. When we compute the distance between q and T (re¬ 
constructed as Ci(pi)), we have known the distance 

between q and T’s father T', we have: 

||q - Tf = ||q - T'||2+||q - 


Cm{PmfT' = Cm{Pm)^ Ci{pi) = € 
i=l 

Thus the distance computation between a node and the 
query can be done efficiently in 0(1). ^ 

^We can further reduce the number of additions and table look 
ups with a smart implementation, please refer to the supplementary 
material. 







(b) Training Time vs Distortion 


(c) Multi-path Encoding 


Figure 4: Empirical Analysis on Dictionary Annealing. 4(a) Online learning with Dictionary Annealing further lowers quan¬ 
tization error. Experiment on SIFT IB with K = 256, M = 8, feeding DA with 100,000 new vectors in batch. 4(b)Training 
time distortion on GISTIM: Whenever we obtained usable dictionaries for encoding, we encode the dataset and check time. 
Note: OPQ is initialized with PQ and DA continues optimization with online learning. Experiment runs on one GTX980 GPU. 
4(c) Encoding GISTIM dataset with Multi-path Encoding: we use DA-online dictionaries to examine how L affects distortion 



Figure 5: A toy demonstration of searching with A-Tree. The orange nodes are the current candidate list. The blue nodes are 
ready to enter the candidate list. The nodes closet to the query vector is selected in the new candidate list. 


After the above steps we have obtained the list of ap¬ 
proximate nearest neighbors. The configuration of Li has 
an infiuence on the final search quality, which will be dis¬ 
cussed in the Experiments Section. Candidate listing with 
A-Tree is highly efficient: the overall time complexity is 
0{N' + dKM), where N' refers to the total number of 
nodes traversed. Note that A-Tree is a tree structure, the per¬ 
formance is heavily dependent on the implementation. 

Experiments 

In this section we present the experimental evaluation of 
Dictionary Annealing and A-Tree. All experiments are done 
on an Core i7 running at 3.5GHz with 16G memory, single 
threaded. 

Datasets 

We use the following datasets commonly used to validate 
the efficiency of ANN methods: SIFTlM(Jegou, Douze, 
and Schmid 2011), contains one million 128-d SIFT (Lowe 
2004) features and 10000 queries; GISTlM(Jegou, Douze, 
and Schmid 2011), contains one million 960-d GIST (Oliva 
and Torralba 2001) global descriptors and 1000 queries; 
SIFTlB(Jegou et al. 2011) contains one billion 128-d SIFT 
feature as base vectors, lOK queries. 

Performance of Dictionary Annealing 

We compare the following state-of-the-art encodings: 
Optimized Product Quantization(OPQ), Composite 
Quantization(CQ)(Ting Zhang 2014), Additive Quanti- 
zation(AQ)(Babenko and Lempitsky 2014), Tree Quan¬ 


Methods 

8B-SIFT1M 

16B-S1FT1M 

8B-G1ST1M 

16B-G1ST1M 

AQ 

19196.26 

9799.86 

0.6785 

0.5277 

OPQ 

22239.78 

10468.39 

0.6973 

0.5361 

PQ 

23540.75 

10534.82 

0.7056 

0.6976 

TQ 

(abour20000) 

(about'OOOO) 



offline-DA 

18416.55 

9444.11 

0.6456 

0.4847 

online-DA 

16573.20 

5901.43 

0.6201 

0.4583 


Table 1: A comparison of quantization error between differ¬ 
ent quantization methods. We used K = 256, M = 8/16 
for all methods. Values in brackets are from (Babenko and 
Lempitsky 2015). 


tization(TQ) and it’s optimized version Optimized Tree 
Quantization(Babenko and Lempitsky 2015). We re¬ 
implemented AQ and OPQ by ourselves, and reproduce 
the results from (Ting Zhang 2014) and (Babenko and 
Lempitsky 2015) to present the evaluation. We choose the 
commonly used configuration: K = 256 as the dictionary 
size and M = 8,16 for all methods. 

We use SIFTIM and GISTIM for evaluation, and train all 
methods on the training set and encode the whole dataset. 
We also train online DA with all the data^, and report the 
training time vs distortion graph to in Figure 4(b), DA runs 
almost as fast as RVQ and much faster than AQ. The quan¬ 
tization error is presented in Table 1, our AQ has a much 
lower quantization error than other state-of-the-art. We per¬ 
form exhaustive NN-search and report the performance of 


^We didn’t train other methods on the whole dataset because 
they require too much memory 











































(a) Search quality 


(b) Number of nodes visited per query 


Time in ms 

(c) Performance curve 


Figure 7: Non-exhaustive search on A-tree, candidate list size for each layer is Li = where Lq = 1, 2,4, • • • , 256. On 

the last layer we shrink the list to 100 and check if the true nearest neighbor is in the list. 



Figure 8: The comparison of Recall vs Number of items re¬ 
trieved curve with 64bits encodings. 


System 

Recall® 1 

Recall® 100 

Query Time 

IVFADC 

(a0.S§)0.107 

(0.755)0.729 

(74ms) 65ms 

Multi-D-ADC 

{0.158)0.U9 

(0.700)0.717 

(6ms) 3.4ms 

Multi-ADC 

(~a 05)0.064 

(~0.6)0.582 

3.2ms 

LOPQ 

(0.799)0.182 

(0.909) 0.890 

69ms 

A-Tree 

0.137 

0.7451 

0.63ms 


Table 2: Comparison of various non-exhaustive search meth¬ 
ods, for Multi-D-ADC, Multi-ADC, K = 2^ and T = 
10000. Result in brackets are taken from (Babenko and Lem- 
pitsky 2012) and (Kalantidis and Avrithis 2014). 


different methods in Figure 8. It can be seen that DA con¬ 
sistently perform better than other state-of-the-art methods. 
It’s online learning version further pushes the performance 
of the encodings higher, for example by 13.6% lower distor¬ 
tion and 23.07% higher recall® 1 for NN-Search on 8-Bytes 
SIFTIM encoding. 


tors have the exact same prefix. We let Li = LqL\ in our 
experiments. Figure 7(b) reports the number of nodes tra¬ 
versed, though Li grows exponentially, the total number of 
traversed nodes is limited. We also report the performance 
of an exhaustive ADC(7.25' per query) on the whole dataset. 
A-Tree delivers asymptotic performance to exhaustive ADC 
by magnitudes of acceleration as shown on Figure 7(a). One 
can use a longer encoding for preciser search result. We fi¬ 
nally draw the performance curve of A-Tree in Figure 7(c). 
A-Tree achieves an amazing speed at 0.63ms with a high 
search quality of 74.51% Recall® 100, at the elbow of the 
curve. 

In Table 2 we compared A-Tree with our speed optimized 
implementations of IVFADC(Jegou, Douze, and Schmid 
2011), Locally Optimized Product Quantization(Kalantidis 
and Avrithis 2014), Multi-D-ADC and Multi-ADC (Babenko 
and Lempitsky 2012). A-tree achieves 9.5x acceleration over 
Multi-D-ADC and over 117x accleration over IVFADC with 
comparable performance. We think this is mainly because: 

1. A-Tree joins candidate listing and re-ranking procedures 
together to avoid excessive ’’pre-computation”. It also 
make A-Tree cache friendly. While other methods re¬ 
quires many times of re-calculating the look-up table and 
cache unfriendly. 

2. A-Tree is based on HCLAE so a shorter list of candidates 
could already achieve satisfying result. While for IV¬ 
FADC, a typical length of candidates is 80M on SIFTIB 
dataset. 


Searching with Aggregating Tree 

Now we evaluate the performance of Aggregating Tree. We 
constructed an A-Tree for SIFTlB(DA-online with lOM 
vectors of the dataset, M = = 256). We design 

the A-Tree to be computation efficient^. The outcome data- 
structure occupies 14.53GB (total 1,224,574,028 Nodes 
consisting of 988,853,094 leaf nodes and 235,720,934 in¬ 
ternal nodes) memory for SIFTIB with 64-bit encoding, in¬ 
cluding vectors ID. 

The choice of Li is important for searching with A-Tree. 
The encodings by DA don’t always guarantee the local vec- 

"^Implementation details are presented in supplementary mate¬ 
rials 


3. DA produces high quality encoded dataset, especially 
with online leaming(Recall® 100:0.834 on 64 bit, com¬ 
pared to Composite Quantization (Ting Zhang 2014) 
:~0.7, OPQ: ~0.65, PQ: ~0.55) 

Conclusion 

In this paper, we introduced the concept of High Capacity 
Locally Aggregating Encodings(HCLAE) for ANN search. 
We proposed Dictionary Annealing to produce HCLAE, and 
Aggregating Tree to perform fast non-exhaustive search. 
Empirical results on datasets commonly used for evaluating 
ANN search methods demonstrated our proposed approach 
significantly outperforms existing methods. 
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